feat: Drop raw schema from get-dataset; stop nudging schema tool#986
Merged
Conversation
7efe277 to
bef46e3
Compare
bef46e3 to
50c3bb5
Compare
Calibration outcome for #882 (probe: 10 top store Actors; Mixpanel: the get-dataset-schema tool is rarely called): - get-dataset no longer returns the raw Apify dataset.schema (93–95% of response bytes on top Actors, 23–39% phantom fields). The flat `fields` list it already returns is the complete, projection-ready inventory. - The terminal get-dataset-items nextStep no longer nudges toward the context-heavy get-dataset-schema; it points at get-dataset for the field list instead (keeping the #1007 loaded-tool gating). - get-dataset-schema stays as an on-demand tool, unchanged. Note for apify-mcp-server-internal: get-dataset structuredContent no longer carries the schema key. https://claude.ai/code/session_01Sf9wACoa9h9y2m2WZ2Sde5
50c3bb5 to
6a9927d
Compare
RobertCrupa
approved these changes
Jun 29, 2026
RobertCrupa
added a commit
that referenced
this pull request
Jun 29, 2026
## What
Nudges the `get-dataset-items` tool description to pass `fields` when
only specific columns are needed, to reduce response size.
The hint is framed as optional ("when you only need specific columns")
so the model does not under-fetch data the user actually asked for. It
points only at `get-dataset` for discovering field names — a lightweight
metadata call — not `get-dataset-schema`, which is often large and would
blow out the context (the opposite of the goal here; see #986).
## Why
Takeaway from the workflow eval work: dataset responses are often large,
and selecting only the needed columns meaningfully cuts response size
when the full row isn't required.
## Verification
- `pnpm run type-check` — clean
- `pnpm run lint` — clean
- `pnpm run test:unit` — clean
- `pnpm run format` — clean
- `pnpm run check:agents` — clean
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #882
What
get-datasetstops returning the raw Apifyschema. The flatfieldslist it already returns is the complete, projection-ready field inventory.get-dataset-schema(a context-heavy call): dropped fromget-dataset's description, and fromget-dataset-items' last-pagenextStep(now points atget-datasetfor the field list). This keeps the loaded-tool gating added in [Bug]: Fixabort-actor-runandget-dataset-itemsfrom external review #1007 — it just redirects the gated reference fromget-dataset-schematoget-dataset.get-dataset-schemais unchanged — still available on demand, returns the full inferred schema (no depth cap).Why
Two findings drove this:
dataset.schemais 93–95% of theget-datasetresponse bytes on top Actors and declares 23–39% phantom fields (disabled add-ons, etc.). Meanwhile the flatfieldslist already carries every projectable path, including deep ones (media.videoDeliveryLegacyFields.dash_manifest_url), and is exactly the formatget-dataset-itemsfields=consumes.get-dataset-schemais rarely called. So the win isn't capping its output — it's (a) dropping the redundant raw schema from the tool people do use, and (b) not steering the model into a heavy call it seldom needs.fieldsis the one canonical description of dataset shape.The tradeoff: the model no longer sees types/formats up front. It learns them by reading an item (which it does anyway), or by calling
get-dataset-schemaexplicitly.Effect
get-dataseton acompass/crawler-google-placesdataset: 41 KB → 2.9 KB, key set identical minusschema.Dropped raw
schemaBEFORE (excerpt) — completeviews+ first of 94fields(35 KB total){ "actorSpecification": 1, "views": { "overview": { "title": "Overview", "transformation": { "fields": ["title", "totalScore", "reviewsCount", "street", "city", "website", "phone", "url"] }, "display": { "component": "table", "properties": { "title": { "label": "Place name" } } } }, "leadsEnrichment": { "title": "Lead Enrichment", "transformation": { "fields": ["title", "firstName", "lastName", "email", "jobTitle", "companyName", "leadsEnrichment"], "unwind": ["leadsEnrichment"] }, "display": { "component": "table" } } }, "fields_excerpt": { "orderOnline": { "type": "object", "properties": { "pickUps": { "type": "array", "items": { "$ref": "#/definitions/PickUpItem" }, "description": "Pickup options" }, "deliveries": { "type": "array", "items": { "$ref": "#/definitions/DeliveryItem" }, "description": "Delivery options" } }, "description": "Online ordering options. Omitted unless `scrapeOrderOnline` is enabled.", "examples": [{ "pickUps": [{ "name": "Pickup", "orderUrl": "https://example.com/pickup" }] }] } } }Complete
get-datasetresponse AFTER (2.9 KB, noschemakey; secret redacted){ "id": "gUWSHSZ1fALKL1pgB", "name": null, "userId": "…", "itemCount": 3, "cleanItemCount": 3, "actId": "nwua9Gu5YrADL7ZDj", "actRunId": "xoPvwNjdbrkWZUSZH", "stats": { "storageBytes": 2711, "readCount": 23, "writeCount": 3 }, "fields": [ "title", "categoryName", "address", "city", "countryCode", "phone", "website", "url", "totalScore", "reviewsCount", "placeId", "location.lat", "location.lng", "openingHours.day", "openingHours.hours", "additionalInfo.Service options.Delivery", "additionalInfo.Service options.Dine-in", "additionalInfo.Payments.Credit cards" /* …85 paths total, all dot-notation, projection-ready */ ], "consoleUrl": "https://console.apify.com/storage/datasets/gUWSHSZ1fALKL1pgB", "generalAccess": "FOLLOW_USER_SETTING", "summary": "Dataset 'gUWSHSZ1fALKL1pgB' has 3 items, 85 fields.", "nextStep": "Use get-dataset-items with datasetId=gUWSHSZ1fALKL1pgB and limit (for example 20) to fetch items." }Tests
get-dataset-itemsterminalnextSteppoints atget-dataset(the [Bug]: Fixabort-actor-runandget-dataset-itemsfrom external review #1007 gating tests updated accordingly);get-datasetfield normalization unchanged. Full suite: 968 passed.get-datasethas noschemakey andnextStepmentions only items;get-dataset-itemsroutes toget-dataset;get-dataset-schemastill returns the full schema on demand.Not in scope
For wide datasets the flat
fieldslist is itself large (~307 paths for a Facebook-posts dataset). Capping or makingfieldsnavigable is a separate question — worth its own issue.Note for apify-mcp-server-internal
get-datasetstructuredContentno longer carries theschemakey — contract suite may need a matching update. Not a tool removal (so not breaking in that sense), but the response shape changed.https://claude.ai/code/session_01Sf9wACoa9h9y2m2WZ2Sde5